tReasure: R-based GUI package analyzing tRNA expression profiles from small RNA sequencing data

Background Recent deep sequencing technologies have proven to be valuable resources to gain insights into the expression profiles of diverse tRNAs. However, despite these technologies, the association of tRNAs with diverse diseases has not been explored in depth because analytical tools are lacking. Results We developed a user-friendly tool, tRNA Expression Analysis Software Utilizing R for Easy use (tReasure), to analyze differentially expressed tRNAs (DEtRNAs) from deep sequencing data of small RNAs using R packages. tReasure can quantify individual mature tRNAs, isodecoders, and isoacceptors. By adopting stringent mapping strategies, tReasure supports the precise measurement of mature tRNA read counts. The whole analysis workflow for determining DEtRNAs (uploading FASTQ files, removing adapter sequences and poor-quality reads, mapping and quantifying tRNAs, filtering out low count tRNAs, determining DEtRNAs, and visualizing statistical analysis) can be performed with the tReasure package. Conclusions tReasure is an open-source software available for download at https://treasure.pmrc.re.kr and will be indispensable for users who have little experience with command-line software to explore the biological implication of tRNA expression. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04691-1.


Introduction
tReasure (tRNA Expression Analysis Software Utilizing R for Easy use) is a user-friendly tool for the analysis of tRNA expression from deep-sequencing data of small RNAs using R packages. tReasure package includes several RNA-seq R packages, which are available in www.bioconductor.org. tReasure covers the whole analysis workflow of high throughput sequencing experiments to identify and visualize of differentially expressed tRNAs using FASTQ File format.
tReasure is a package for the R computing environment; therefore, you must install R and Rstudio (https://rstudio.com) before installing tReasure. tReasure requires the gwidget2 graphical library to run and several additional packages for the analysis of RNA-seq.

➢ Preliminaries
• For Window: User need to install Rtools.

User Interface
On the main page of tReasure, Tabs for analysis were indicated below; Introduction, Uploading Samples, Quality Control, Alignment & Counting, filtering, DetRNA Detection, Visualization. Each tab corresponds to a particular step of the analysis workflow. Left panel contains the user-defined parameters, Right panel represents the output of each process, and Bottom panel shows analysis progress ( Figure 1). Above all, prepare small RNA-seq dataset formatted FASTQ. We provided an example dataset (smallRNA-seq data, GSE68085) [1] for practicing the analysis. The example dataset contains 114 breast tissues of small RNA-seq data which is the total of 103 tumors and 11 normal.

Tab "Uploading Samples"
Before starting analysis, create a folder (e.g., named "BCproject") on your local computer and move the dataset to your local folder. This folder will be set as a working directory and as a storage of outputs, later.

How to make a sample list
① Click "Open" button on the Left panel for selecting the directory of raw FASTQ files (Figure2).     The sample list is displayed on the Right panel and the list is saved in a folder named as "sample.txt" at the same time (i.e., /data6/BCproject/sa mple.txt).
FileName: Paths and names of the files containing the raw data. The con tents generated automatically according to the information of files.

SampleName:
The sample names of each sequence data and create fro m raw data filename. The contents generated automatically according t o the information of files.

Group:
The group information of samples. There are filled with one of "c ontrol', "test" or "NA" by default.
Batch: A batch information of samples. There are filled with "NA" by def ault. Note. There are two methods to modify those.
First, you can directly select the group information and add the batch information on the Right panel. The revised sample list is automatically saved as "sample.txt" ( Figure 6).
Second, you can modify a saved file (i.e., "sample.txt' of ④) using Microsoft Excel. After modifying the sample list in Excel, you must save it as text format (tab delimited) without changing the file name and sample name. Then, click "Open" in the [*OPTION] box on the Left panel ( Figure 7).  Caution. There should be at least two replicates in one group for statistical analysis of differential gene expression.
Caution. Don't click the "RUN" button again after finishing modification. Move on to the next step ("Quality Control" tab). If user click "RUN" button after modification, it can reset the sample list.

Working directory
Once you click "RUN" (④) button for making sample list, the selected folder is set as the working directory ($WORKDIR) automatically (i.e., $WORKDIR = /data6/BCp roject). At the same time, four subdirectories will be created to save the outputs o f subsequence procedures such as $WORKDIR/pre, $WORKDIR/post, $WORKDIR/r c and $WORKDIR/stat/plot ( Figure 8).

Tab "Quality Control"
As a next step, you need to remove adapter sequence and poor-quality reads. Those are performed using preprocessReads function from the package QuasR [2].

Workflow of analysis
First, check the adapter information in small RNA-seq data. tReasure provides four options of adapters (Illumina smallRNA 3' adapter, Illumina universal adapter, SOLiD adapter, and No adapter) ( Figure 9). Second, choose the threshold value of Q-score (quality score) to filter out low quality reads. tReasure provides two values ("25" and "30") for minimum quality threshold. Regarding minimum length, tReasure provides "10" as a threshold for optimal detection of tRNAs from small RNA-seq.
① Select the user-defined parameters on Left panel. Figure 9 is an example.  Each column represents individual small RNA seq data, they have four kinds of QC information (totalSequences, matchTo3pAdapter, tooShort, and totalPassed).
totalSequences: total number of reads matchTo3pAdapter: number of reads that matched to the 3' adapters tooShort: number of reads that were too short to analyze totalPassed: number of reads that passed the filtering step  Trimmed FASTQ files are saved in that directory "_trim.fastq" (Figure 11).

Tab "Alignment & Read Counting"
After QC, tReasure provides read alignments against genome assembly using Rbowtie aligner based on qAlign function of QuasR packages [2]. Rsamtools [3] package is used for filtering and counting of reads.

tRNA mapping strategy
tReasure performs a specific mapping method for tRNA genes, which is modified a previous method [4]. In brief, an artificial genome is generated by masking all annotated tRNA genes and adding pre-tRNA genes (i.e., tRNA genes with 3' and 5' genomic flanking regions) as extra chromosomes. tRNA annotations are obtained by using tRNAscan-SE [5], which predicted all tRNA sequence. Upon mapping to this artificial genome by Rbowtie, sequence reads that map to the tRNA-masked chromosomes or to the tRNA flanking regions are filtered out to remove non-tRNA reads and unmatured-tRNA reads, respectively.
tRNAs without flanking region are transformed to mature tRNAs by appending 3' CCA tails and removing introns. The subset of filtered reads from the first mapping is aligned against the mature tRNAs using Rbowtie.
tReasure provides artificial genome and mature tRNAs sequence of 4,781 species (540 eukarya, 4,024 bacteria, and 217 archaea). When starting alignment, the genome files of your choice are automatically downloaded from tReasure webserver.

Quantification of tRNAs
tRNAscan-SE, a frequently used tool to predict tRNA genes, is provided a score assigned to each putative tRNA gene. The genes with high score (>50) are likely bona fide tRNA genes, while those ranked with a low score are likely pseudogenes [5]. And tRNAscan-SE determines the "high confidence" set of genes that are most likely function in the translation process, by assessing a combination of domainspecific, isotype-specific, and secondary structure scores. As a benchmark, tReasure counts and uses only the cytosolic and high confidence tRNAs of all predicted them.
The identity of each tRNA is defined by its corresponding amino acid and by its ant icodon sequence. tRNAs charged with the same amino acid are isoacceptor tRNAs (e.g., tRNA-Arg), and tRNAs with the same anticodon sequence are known as isod ecoder tRNAs (e.g., tRNA-Arg-TCT). In this work, tReasure combines tRNA genes h aving the same mature tRNA sequence into tRNA families. In other words, tReasur e quantify the mapped reads of isodecoder genes that a single isodecoder set wer e assigned to individual tRNA genes (e.g., tRNA-Arg-CCG-2-1), and multiple tRNA g enes with identical sequences were assigned to a single "tRNA family" (e.g., tRNA-Arg-ACG-1) [6] (Figure 12). By using these concepts, tReasure is capable of detecti ng at a broad level from individual to isodecoder, isoacceptor.

Workflow of alignment and read counting
First, select genome assembly according to your studies.

Figure 16. Downloading of files for alignment
Note. Before aligning the small RNA seq reads against a reference genome, it is necessary to do indexing on the genome. To reduce execution time, we provide the pre_built index of eukaryotic genome, which is also downloaded with genome.
 tReasure goes through three steps automatically one by one: pre-mapping, postprocessing, and counting of reads.
After first pre-mapping step, the BAM files are saved in subdirectory ("$WORKDIR/pre"). And the summary of alignment is also saved named as "preprocessing_algin_stat.txt", which contains the size of the target sequence as well as the number of mapped/unmapped reads for each sequence file (Figure 17). Figure 17. The output of pre-mapping step After postprocessing step, results are saved in subdirectory ("$WORKDIR/post"). There are FASTQ files for the secondary library of mature tRNAs named "_mature" to the end of the filename, and the BAM files of outputs of mapping against mature tRNAs. Likewise, the summary of alignment is saved as "postprocessing_align_stat.txt" (Figure 18). After counting of reads, the number of alignments in mature tRNAs is quantified and the result tables (tRNAs in rows and samples in columns) are produced for further analysis. There are three tables: individual, isodecoder, and isoacceptor levels of tRNAs ( Figure 19).
The tables are saved in subdirectory ("$WORKDIR/rc") as below: readcount_trnas.txt readcount_isodecoders.txt readcount_isoaccepters.txt Figure 19. The output of read counting Note. The process of status is displayed on the Bottom panel.

Tab "Filtering"
Before the statistical analysis, tReasure provides the function of filtering out the normalized genes having low read counts. The counts per gene were normalized to CPM (counts per million) using cpm function of edgeR packages [7].

Workflow of filtering
tReasure supports filtering out tRNA genes that does not have at least 'm' CPM value (0 to 10) in at least 'n' samples (0 to 100 %).   The summary of filtering displays on the Right panel for "individual", "isodecoder", and "isoacceptor" levels of tRNAs ( Figure 21). The tables save in subdirectory ("$WORKDIR/rc") as named files as below.
In tReasure, DEseq2 implements the statistical test using Walt test, while EdgeR implements likelihood ratio tests or quasi-likelihood F-tests as well as exact statistical methods for differential expression. Limma implements the empirical bayes statistics test method.
For multiple correction, tReasure provides three methods of FDR, Bonferroni correction, and Benjamini-Hochberg. For determining significance of different expression, adjusted p-value provides three value (0.001, 0.05 and 0.01) and the value of log2 fold-change from 0 to 2 increasing by 0.5.

Workflow of DEtRNA Detection
① Choose statistical method and set the statistical parameters.
As an example, quasi-likelihood F-test of edgeR was selected for statistical test (Figure 22).
 The results are display on Right panel and saved in subdirectory ($WORKDIR/stat/) named "stat_" as below. The data contains the value of logFC, logCPM, F-static, raw, and three adjusted p-value (FDR, Bonferroni, and Benjamini) for each tRNA (Figure 23).
stat_trna_list.txt stat_isodecoder_list.txt stat_isoacceptor_list.txt ③ Choose one adjustment method for multiple correction and set the threshold value in the Left panel ( Figure 24) for detecting significant differential expression.
 Filtered tRNAs are automatically reflected and showed on the Right panel ( Figure 24). Also, filtered data is saved in subdirectory ($WORKDIR/stat/) named "DE" as below.

Figure 24. Statistical significance results
Note. The process of status is displayed on the Bottom panel.
The plots generated in a previous step are displayed on windows and saved in the subdirectory ("$WORKDIR/stat/plot") as a png format. Upon you clicking the tap of "Visualization", there is no plot. In the Visualization tab, there are four sub-tabs: "MDS plot", "Plot_trnas", "Plot_isodecoders", and "Plot_isoacceptors". tReasure provide four kinds of plots for each tRNA levels as below.
MDS plot shows the similarity/dissimilarity of the expression profiles between the samples (Figure 25).

Figure 25. MDS plot
Volcano plot shows statistical significance (adjusted p-value) versus magnitude of fold change using the analysis results of the "individual tRNA genes".
For example, we identified 111 upregulated and 25 downregulated individual tRNAs in breast cancers compared with normal samples (Figure 26).

Figure 26. Volcano plot of individual tRNAs
Bar plot represents the frequency of significantly expressed tRNA-anticodons (adjusted p-value) using the results of the "isodecoders".
For example, we identified 69 upregulated and 14 downregulated isodecoders in breast cancers compared with normal samples (Figure 27).

Figure 27. Bar plot for isodecoders
Pyramid plot displays the frequency of significantly expressed tRNA-amino acid using the results of the "isoacceptors".
For example, we identified 10 upregulated and 7 downregulated isoacceptors in breast cancers compared with normal samples (Figure 28).

Customizing plots
Users can modify plots (size, color, and so on) by loading files of plots, named "*.RData", which are saved in the subdirectory ("$WORKDIR/stat/plot"). There are three RData files such as Volcano_Plot.RData, Pyramid_Plot.RData and Bar_Plot.RData. Each file contains one data. Frame. And one function objects to draw corresponding plot in different R session. You can modify the value in the code ( Figure 29).

[Example: customizing a volcano plot]
① Open R or Rstudio ② Load package "ggplot2" and data "Volcano_Plot.RData". Type on command window as below.
 You can find one data.frame object ("detRNA"), two values("fc", "pval"), and one function object of plot ("p") on the Global Environment panel of R (or Rstudio). "detRNA" is a table of the statistical results and two values ("fc", "pval") are the pre-defined the threshold value.
 You can find one data.frame object ("geneplot") and one function object of plot ("p") on the Global Environment panel of R (or Rstudio). You can find the "pyramid.plot(…)" in p() function.

Option
➢ Case 1. If user wants to re-analyze with new statistical analysis parameters, Re-analyze with new statistical analysis parameters should go back the step after "Alignment and Read counting" tab.
① Click "Filtering" tab. Bottom sided of Left panel have option section.
② Set the previous or the folder that you want to reanalyze with new statistical analysis.
Note. The working directory should contain raw FASTQs (e.g., /data6/BCproject). Before doing statistical analysis, you could check the table of read counts by loading files on the Left panel of "Alignment and Read counting "tab (Figure18).
➢ Case 2. If user finish the read counting and couldn't finish filtering, In case user exit tReasure before filtering, user needs to reset the working directory on the Left panel of "Filtering" tab ( Figure 20).
② Reset the working directory on the bottom sided of Left panel.